This is the example script (partly modified) described in “Explainable artificial intelligence enhances the ecological interpretability of black-box species distribution models” by Ryo et al. The original code can be found here
The data species occurrence and climate data is taken form here: https://www.gbif.org/ using the {sdmbench} R package
library(sf)
library(mapview)
library(sp)
library(maptools)
library(sdmbench)
library(mlr)
library(iml)
library(lime)
library(dplyr)
library(ggplot2)
library(rsample)
library(withr)
library(RStoolbox)
library(DT)
dictionary = read.csv("worldclim_dictionary.csv")
Rare species, lives on calcareous rocks. Meso and microclimate as well as topography play a crucial role in driving its presence.
species = "Campanula morettiana"
climate_resolution = 2.5
# Get and prepare data ----------------------------------------------------
set.seed(42)
# this function downloads and stores the data, and to make the rest of the script reproducible (GBIF data can be updated with new observations) we are loading a stored static dataset
occ_data_raw <-
get_benchmarking_data(species, limit = 1000,climate_resolution = climate_resolution)
## [1] "Getting benchmarking data...."
## [1] "Cleaning benchmarking data...."
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\GGenova\AppData\Local\Temp\RtmpEfq5x8", layer: "ne_50m_land"
## with 1420 features
## It has 3 fields
## Integer64 fields read as strings: scalerank
## [1] "Done!"
colnames(occ_data_raw$df_data) = c(dictionary$Abbreviation,"label")
names(occ_data_raw$raster_data$climate_variables) <- dictionary$Abbreviation
occ_data <- occ_data_raw$df_data
occ_data$label <- as.factor(occ_data$label)
coordinates_df <- rbind(occ_data_raw$raster_data$coords_presence, occ_data_raw$raster_data$background)
occ_data <- normalizeFeatures(occ_data, method = "standardize")
occ_data <- cbind(occ_data, coordinates_df)
occ_data <- na.omit(occ_data)
occ_sf <- st_as_sf(occ_data,coords = c("x", "y"), crs = 4326, agr = "constant")
occ_sf = occ_sf %>% filter(label == 1)
mapview()+mapview(occ_sf,zcol = "label")
set.seed(42)
train_test_split <- initial_split(occ_data, prop = 0.7)
data_train <- training(train_test_split)
data_test <- testing(train_test_split)
data_train$x <- NULL
data_train$y <- NULL
data_test_subset <- data_test %>% filter(label == 1)
Train the Random Forest model
task <-
makeClassifTask(id = "model", data = data_train, target = "label")
lrn <- makeLearner("classif.randomForest", predict.type = "prob")
mod <- train(lrn, task)
pred <- predict(mod, newdata=data_test)
VIMP <- as.data.frame(getFeatureImportance(mod)$res)
# resampling
sample_data <- withr::with_seed(10, sample_n(data_test_subset, 3))
sample_data_coords <- dplyr::select(sample_data, c("x", "y"))
sample_data$x <- NULL
sample_data$y <- NULL
customPredictFun <- function(model, data) {
v <- predict(model, data, type = "prob")
v <- as.data.frame(v)
colnames(v) <- c("absence", "presence")
return(v$presence)
}
normalized_raster <- RStoolbox::normImage(occ_data_raw$raster_data$climate_variables)
pr <-
dismo::predict(normalized_raster,
mlr::getLearnerModel(mod, TRUE),
fun = customPredictFun)
coordinates(sample_data_coords) <- ~ x + y
sl1 <- list("sp.points", sample_data_coords, pch=1, cex=1.2, lwd=2, col="white")
sl2 <- list("sp.pointLabel", sample_data_coords,
label=c("Case 1","Case 2","Case 3"),
cex=0.9, col="white", fontface=2)
rf_map <-
spplot(pr, main = "Habitat Suitability Map",
scales = list(draw = TRUE),
sp.layout = list(sl1,sl2),
labels=TRUE
)
rf_map
# Top n important variables
top_n(VIMP, n=5, importance) %>%
ggplot(., aes(x=reorder(variable,importance), y=importance))+
geom_bar(stat='identity')+ coord_flip() + xlab("")
# Performance
performance(pred, measures=auc)
## auc
## 0.9792238
Accumulated local effects and partial dependence plots both show the average model prediction over the feature. The difference is that ALE are computed as accumulated differences over the conditional distribution and partial dependence plots over the marginal distribution. ALE plots preferable to PDPs, because they are faster and unbiased when features are correlated.
ALE plots for categorical features are automatically ordered by the similarity of the categories based on the distribution of the other features for instances in a category. When the feature is an ordered factor, the ALE plot leaves the order as is.
predictor <-
Predictor$new(mod, data = data_train, class = 1, y = "label")
ale <- FeatureEffect$new(predictor, feature = "Temp_seasonality")
ale$plot() +
theme_minimal() +
ggtitle("ALE Feature Effect") +
xlab("Temperature seasonality")
predictor <-
Predictor$new(mod, data = data_train, class = 1, y = "label")
ale <- FeatureEffect$new(predictor, feature = "Annual_precip")
ale$plot() +
theme_minimal() +
ggtitle("ALE Feature Effect") +
xlab("Annual precipitation")
Once an explainer has been created using the lime() function it can be used to explain the result of the model on new observations. The explain() function takes new observation along with the explainer and returns a data.frame with prediction explanations, one observation per row. The returned explanations can then be visualised in a number of ways, e.g. with plot_features()
We select three predictions at random
set.seed(42)
explainer <- lime(data_train, mod)
set.seed(42)
explanation <-
lime::explain(sample_data,
explainer,
n_labels = 1,
n_features = 5)
plot_features(explanation,ncol = 1)+xlab(NULL)
An example of an alpine Grasshopper
For the next species we will only change the following lines of code:
species = "Miramella alpina"
climate_resolution = 10
Train the Random Forest model
## auc
## 0.9515574
Accumulated local effects and partial dependence plots both show the average model prediction over the feature. The difference is that ALE are computed as accumulated differences over the conditional distribution and partial dependence plots over the marginal distribution. ALE plots preferable to PDPs, because they are faster and unbiased when features are correlated.
ALE plots for categorical features are automatically ordered by the similarity of the categories based on the distribution of the other features for instances in a category. When the feature is an ordered factor, the ALE plot leaves the order as is.
Once an explainer has been created using the lime() function it can be used to explain the result of the model on new observations. The explain() function takes new observation along with the explainer and returns a data.frame with prediction explanations, one observation per row. The returned explanations can then be visualised in a number of ways, e.g. with plot_features()
We select three predictions at random
A mediterranean plant species
For the next species we will only change the following lines of code:
species = "Asparagus acutifolius"
climate_resolution = 10
Train the Random Forest model
## auc
## 0.9284497
Accumulated local effects and partial dependence plots both show the average model prediction over the feature. The difference is that ALE are computed as accumulated differences over the conditional distribution and partial dependence plots over the marginal distribution. ALE plots preferable to PDPs, because they are faster and unbiased when features are correlated.
ALE plots for categorical features are automatically ordered by the similarity of the categories based on the distribution of the other features for instances in a category. When the feature is an ordered factor, the ALE plot leaves the order as is.
Once an explainer has been created using the lime() function it can be used to explain the result of the model on new observations. The explain() function takes new observation along with the explainer and returns a data.frame with prediction explanations, one observation per row. The returned explanations can then be visualised in a number of ways, e.g. with plot_features()
We select three predictions at random
A butterfly endemic of the Alpine region
For the next species we will only change the following lines of code:
species = "Plebejus trappi"
climate_resolution = 2.5
Train the Random Forest model
## auc
## 0.6846154
Accumulated local effects and partial dependence plots both show the average model prediction over the feature. The difference is that ALE are computed as accumulated differences over the conditional distribution and partial dependence plots over the marginal distribution. ALE plots preferable to PDPs, because they are faster and unbiased when features are correlated.
ALE plots for categorical features are automatically ordered by the similarity of the categories based on the distribution of the other features for instances in a category. When the feature is an ordered factor, the ALE plot leaves the order as is.
Once an explainer has been created using the lime() function it can be used to explain the result of the model on new observations. The explain() function takes new observation along with the explainer and returns a data.frame with prediction explanations, one observation per row. The returned explanations can then be visualised in a number of ways, e.g. with plot_features()
We select three predictions at random
DT::datatable(dictionary,rownames = FALSE)